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ABSTRACT 


In this paper we present the derivation of Canonical Workflow Modules from current workflows in 
simulation-based climate science in support of the elaboration of a corresponding framework for simulation- 
based research. We first identified the different users and user groups in simulation-based climate science 
based on their reasons for using the resources provided at the German Climate Computing Center (DKRZ). 
What is special about this is that the DKRZ provides the climate science community with resources like high 
performance computing (HPC), data storage and specialised services, and hosts the World Data Center for 
Climate (WDCC). Therefore, users can perform their entire research workflows up to the publication of the 
data on the same infrastructure. Our analysis shows, that the resources are used by two primary user types: 
those who require the HPC-system to perform resource intensive simulations to subsequently analyse them 
and those who reuse, build-on and analyse existing data. We then further subdivided these top-level user 
categories based on their specific goals and analysed their typical, idealised workflows applied to achieve 
the respective project goals. We find that due to the subdivision and further granulation of the user groups, 
the workflows show apparent differences. Nevertheless, similar “Canonical Workflow Modules” can be 
clearly made out. These modules are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, 
“Data and Software Publication”, “Generating Knowledge” and in their entirety form the basis for a Canonical 
Workflow Framework for Research (CWFR). It is desirable that parts of the workflows in a CWFR act as FDOs, 
but we view this aspect critically. Also, we reflect on the question whether the derivation of Canonical Workflow 
modules from the analysis of current user behaviour still holds for future systems and work processes. 


* Corresponding author: Ivonne Anders (Email: anders@dkrz.de; ORCID: 0000-0001-7337-3009); Karsten Peters-von Gehlen 
(Email: peters@dkrz.de; ORCID: 0000-0003-0158-2957). 
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1. INTRODUCTION 


In addition to long-term meteorological measurements, results obtained from simulations performed with 
climate models form the main basis for research and statements on past and possible future global, regional 
and local climate. These models are based on mathematical equations that express fundamental physical 
relationships and laws, such as the laws of conservation of mass, momentum, and energy [e.g., 1, 2, 3]. 
Running these models e.g., at high spatial resolution is computationally expensive and requires high- 
performance computing (HPC) infrastructure. Depending on the experiment, the data volume for simulations 
over long periods of time and/or at high spatial resolution can reach sizes of TeraBytes (TB) to PetaBytes 
(PB) [4, 5]. Therefore, the geoscience community is working to develop and establish effective and 
reproducible scientific workflows [6]. The aim is to enable researchers to spend more time on the actual 
scientific work [7]. Furthermore, for example, several people or teams often work in a project of coordinated 
climate simulations in which tasks are shared. This once again highlights the need for standardised workflows. 


With the help of climate models, it is possible to investigate and determine interactions of the individual 
components of the climate system, but they are also input data for impact models, i.e., calculations on 
how climate change can affect ecosystems, urban development or various other systems [e.g., 8, 9, 10]. 
Climate models and climate data are the major components of different disciplines of climate science and 
they form the basis for assessing the risks and opportunities of future climate change and for developing 
adaptation measures [e.g., 11, 12, 13]. Various statistical methods are used for the scientific evaluation, but 
also for the processing and use of climate model data [e.g., 14, 15, 16]. Further, the ever increasing data 
amount and complexity of analysis workflows has spawned the development of novel data access and 
analysis infrastructures [e.g., 17, 18, 19]. Upcoming major funding initiatives facilitating a step-change in 
climate science already cast their shadows [e.g., 20] and will allow for the major investments in infrastructure 
needed to transform the way simulation-based climate sciences is performed [21]. 


High Perfomance Computing (HPC) Centers, like the German Climate Computing Center (DKRZ), are 
major partners for climate research, as they provide essential infrastructure, like HPC resources, data storage 
and tailored services to support simulation-based climate science. It should be pointed out here that climate 
scientists can and do perform the entire suite of their data intensive workflows using the DKRZ infrastructure 
and services—ranging from planning and settingup of model simulations, analyzing the model output, 
reusing existing large-volume datasets to data publication and long-term archival. This allows to analyse 
DKRZ user behavior for the purpose of devising generalized analysis workflows in simulation-based climate 
science. These are then amenable to further abstraction to support the development, adaptation and 
dissemination of the Canonical Workflows For Research (CWFR) [22] concept. 


In the first step, we present the results of our user analysis. Based on this, we present the CWFR modules 
we derived from it. We then take a critical look at the use of FAIR Digitial Objects (FDOs) [23, 24] in the 
CWFR process with its possibilities and limitations. Finally, we summarise and give first solutions. 
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2. TYPICAL WORKFLOWS IN SIMULATION-BASED CLIMATE SCIENCE 
2.1 User Types Analysis 


As a basis for deriving CWFR modules in simulation-based climate science, we first analysed the current 
workflows of DKRZ users. We distinguish two main types of DKRZ users: the modellers and the data 
(re)users. However, it is also possible that the modeller switches roles to become a data (re)user. Modellers 
are further divided into modellers with little to no data use in the modelling process itself and modellers 
with unconditional data use. These are those users who run models requiring existing input data. Both 
subgroups of modellers are further subdivided into modellers pursuing a scientific goal, model developers 
and modellers performing production runs. The latter two groups usually do not pursue a scientific objective. 
Of course, there are also users here who nevertheless pursue a research goal afterwards, i.e., change their role. 


The group of data (re)users is divided into three subgroups: researchers pursuing a scientific goal, (climate) 
service providers and impact modellers. 


pursuing a scientific goal 
developing the model 
little to production runs 
no data use 


pursuing a scientific goal 
developing the model 


unconditional production runs 


modeler data use 
data 
(re)user pursuing a scientific goal 


impact modeler 
service provider 
publisher 


Figure 1. User groups and further subdivisions. 


It should be noted that the above presents an idealised view of the DKRZ user groups. Each group has 
been analysed in detail. In each group there are sub-groups that may not carry out parts of the workflows or 
may carry them out repeatedly. This can lead to very complex, interlocking and interdependent workflows. 


2.2 Detailed Workflow Examples 


In the following, we illustrate the idealised workflows of just two specific user groups for the sake of 
brevity: climate modellers with unconditional use of existing data in the modelling process itself and data 
re-users. In each of the described cases, both users pursue a scientific goal. 
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Example 1: Climate Modeller pursuing a scientific question 


The climate modeller in this example (Figure 2) (re)uses data to drive the climate model. This data is 
either already present in the local infrastructure or is copied from external resources (Step 1 in Figure 2). 


os l 


reuse input climate created analyse results out publish knowledge 
data model model model of analysis data for 

simulations output results reuse 

at HPC- with reuse 

system of data 


Figure 2. Idealised linear workflow of a climate modeller pursuing a scientific question. Steps from left to right: 
reusing existing data, carrying out model simulations at HPC-system, storing model output data, analysing model 
results and comparing existing data (reuse), storing output data from analysis, publishing model output data and 
analyzing data in the repository, and generating knowledge; light grey boxes indicate the converting processes. 


The data is then converted (e.g., by an input model) to comply with the input format required by the 
climate model. Before executing the model simulation, the model has to be adapted and compiled according 
to the HPC system specifications. In the example shown, the scientist runs several model simulations with 
the same input data but different model configurations. Output data is obtained for each model run. The 
model needs a converter to write the output data in certain formats. In subsequent analyses and calculations 
on a computer server, the scientist compares the data of the different simulations, but also to other already 
existing datasets, e.g., observation data. This comparative data usually also requires reformatting (converting) 
prior to analysis. Eventually, new data is created, stored, and, like the output data of the simulations, 
published. A converter adapting the data to the standards required by the repository is usually required as 
well. The workflow is finalised with answering the scientific question and generation of new knowledge. 


The scheme shown in Figure 2 would look identical if not one model but different models were used, 
as for example in a Model Intercomparison Project like e.g., CMIP6 [4]. 


Data Intelligence 215 


202211.00444v1 


chinaXiv 


ChinaX iv (ERAT 


Canonical Workflows in Simulation-based Climate Sciences 


Example 2: Data (re)user for reasons of analysis 


The workflow of a scientist reusing existing data and pursuing a scientific question is shown in Figure 3. 
The data either exists in the local infrastructure or is copied from external sources (light orange object and 
dashed line in Figure 3). The data, usually available following different (meta)data formats/standards are 
read-in for analysis by specific routines. They work as converter. Reuse of such routines or converter is 
uncommon, i.e., almost every scientist conceives an individual solution. Next, the prepared data are 
analysed by the scientist. For this, the scientist employs the local software environment and computing 
infrastructure, e.g., analysis servers or even HPC if massive parallel operations are needed. As a result of 
the analyses, new data is generated, stored and published. Another converter is used to adapt the data to 
the standards required by the repository. The scientist answers the scientific question and thus generates 
new knowledge by e.g., summarizing the research results in a scientific publication. 


eS ee ee 


reuse input analyse results publish knowledge 
data from data out of data for 

different analysis reuse 

sources 


Figure 3. Idealised linear workflow of data reuse following a research question. Parts from left to right: reusing 
existing data, analysing and comparing existing data (reuse), storing output data from analysis, publishing analysis 
data in the repository, and generating knowledge; light grey boxes indicate the converting processes. 


3. DERIVING CANONICAL WORKFLOWS FROM USER BEHAVIOUR 


The two illustrated workflows show apparent differences, but also similarities. Specifically, we see that 
steps 4 to 7 in example 1 (Figure 2) are identical to steps 2 to 5 in example 2 (Figure 3). Therefore, 
“Canonical Workflow Modules” (as we call them) can be clearly made out and are amenable to abstraction. 


The modules we define are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, “Data 
and Software Publication”, “Generating Knowledge” (compare Figure 4). In the specific workflow, these 
modules can be repeated individually or in complex interaction. However, individual modules can be 
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omitted or not run through. Between the Canonical Workflow modules, “Converters” must be used, which 
can be Canonical Workflows themselves or FAIR Digital Objects [23, 24]. 


(Re) Use input data /s 


generale 


import to store ot knowledge 
DKRZ OKRZ 


Figure 4. Canonical Workflow modules and identified submodules. 


3.1 Data and Software (Re)use 


Data and software (re)use is an important Canonical Workflow module. Data orange is accessed either 
directly at the local computing infrastructure or obtained from a repository/archive. In the latter case, 
additional steps are needed to either copy the data to the computing infrastructure or access and process 
it remotely. In the modelling process, input data must conform to a certain form or standard (see also 
Section 3.6). 


Software sharing is still not very established and can be challenging [25]. Specifically, sharing of model 
code is not common practice across institutions in climate science [26] due to intricate licensing frameworks 
and non-trivial code adaptations necessary to apply it on different computing infrastructures. 


3.2 Compute 


The CWFR module “Compute” incorporates all kinds of computing infrastructure use. We distinguish 
between two types of infrastructure: HPC and Analysis Servers. HPC environments are used for performing 
model simulations, whereas Analysis Servers are used for pre- and postprocessing of data as well as for 
analysis. Of course, very computationally intensive analyses, like deep learning approaches, must also be 
performed on an HPC system. The “Compute” module also includes a distinction between the use of local 
or external infrastructure, which has an impact on the used converters (compare Section 3.6). 


3.3 Data and Software Storing 


As described in the examples of scientific user workflows, output data but also new software (new parts 
in the model code or analysis software) are created in the process. Writing and storing data and software 
is thus another potential CWFR module. Again, we distinguish between the extent to which data is stored 
close to the user’s computer infrastructure or transferred to external resources. 


3.4 Data and Software Publication 


The process of data or software publication is a complex, as many aspects have to be taken into account. 
Institutional definitions of standards and licences have to be reconciled with those of the funders and the 
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repositories. Software is subject to a different legal framework than data. There are currently discussions 
within the RDA [27], but also at the national level, on how to deal with research software publication to 
increase the awareness for its relevance [28] and provide guidance. In the context of publication, converters 
play a decisive role because they are used for the mapping of metadata. 


From a less technical perspective, it should be noted that publishing data is most often only carried out 
when requested by journals, funders or institutions. This is because data publication (and sharing) is currently 
still not considered from the beginning of the research project and that the data preparation and publication 
process is considered too time-consuming given the lack of evident scholarly benefits [29, 30, 31, 32]. The 
integration of the publication process into the CWFR framework is therefore crucial to incentivise increased 
data publication. 


3.5 Generating Knowledge 


Achieving new knowledge is not a workflow step in the sense that it leads to technical implementation. 
Nevertheless, we include it in the workflow chain because it is an important link between research phases 
in the scientific cycle and plays a role in the further development of theories and solutions. In addition to 
the pure gain of knowledge, the dissemination, i.e., the sharing of knowledge with other scientists, can also 
be considered here, for example through a scientific publication or through the presentation of the results 
at a scientific conference. 


3.6 Converter 


Converters are very important elements in the entire workflow and therefore in a CWER. They occupy a 
special position because they are not bound to the individual research workflow, but can be used and 
reused in a variety of generalised ways as described in Section 2. In Figures 2 and 3 they are shown in 
light grey. Currently, they are individually generated and applied by the scientists themselves, which takes 
valuable time that should be more usefully spent on the scientific question. Therefore, converters offer the 
greatest potential for technical standardisation to make them more transparent and act as FDOs (see also 
Section 4). 


Converters themselves can be individual programmes, but also more complex processes consisting of 
several technical elements. For example, the process of publishing consists of transforming metadata 
standards, adapting data formats, assigning PID, creating entries in repository databases, etc. 


4. FDOS IN CWERS OF DATA INTENSIVE CLIMATE SCIENCE 


CWFRs are the overarching constructs containing the identified workflow modules (Section 3). In these, 
FDOs are the data products resulting from CWFR modules, as well as the converters and the routines used 
in the CWFR modules (tools). The workflow itself is also an FDO, thereby facilitating reuse and reproducibility. 
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In simulation-based climate science, the use of FDOs in every step of the entire process chain would be 
groundbreaking and the advantages are obvious: 


e Provenance tracking and full reproducibility: For each FDO, complete provenance information is 
stored (in the PID profile and in the metadata). Source data, converters, processing software or model 
and model specifications, which are also specified as FDOs, are indexed. Ultimately, the last FDO 
in the workflow contains the information of the entire workflow. If software is viewed as an FDO, 
this object contains not only the usual metadata such as authors and licences, but also all information 
on the infrastructure, compiler and compiler specifications. 

e Creation of globally federated data bases: FDOs have a unique globally accessible PID, allowing for 
global access and data (re)use. Globally federated databases of FDOs building on existing infrastrucutre 
concepts [e.g., 17] would thus be achieved. 

e Supporting BigData approaches: One part of the BigData approaches is the combined use of 
heterogenic data. To achieve this, data must be standardised and described in terms of its heterogeneity 
to faciliated machine-actionability. FDOs have a decisive role here. 

e Automated selection of analysis tools and libraries: With regard to non-expert users, interfaces which 
automatically select analysis tools and libraries in the background are necessary for a desired workflow 
to run through. These interfaces can also be designed as FDOs. 


However, the full implementation of the CWFR concept in climate science would be challenging 
(Section 5.2). 


5. DISCUSSION 
5.1 Status Quo Workflow vs. Future Workflows 


We have identified Canonical workflow modules based on current user behaviour, but do they hold up 
for the future? The modules are kept so that they continue to exist as such. However, it is expected that 
users will not actively engage with the implementation or processing of certain modules in the near to 
distant future. This is especially true for the Data and Software Storing and Publication parts, including the 
standardisation of metadata. However, we expect that with the further establishment of globally federated 
data bases, the distinction between “local” and “external” will technically no longer make a difference. 
The aspect will remain, but the question is how the information of the overall workflow will be preserved. 
A system is needed to log individual location-independent workflow steps and record them in FDOs. 


5.2 Critical View on the Use of FDOs in Simulation Based Climate Science 


Numerical models are the essential component of simulation-based climate science. The model code is 
often not freely available and is subject to strict licensing and usage agreements. Only a few models are 
freely accessible in their source code ([26]) and can thus really be declared as FDO. 
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Further, simulation results depend on the infrastructure used for the simulation ([33]). HPC systems are 
usually replaced after 4-6 years. With the discontinuation of the infrastructure used, the data are also no 
longer reproducible. This would mean that indexing of the model and model setup as a FDO would be 
possible, but no longer repeatedly executable. To overcome this issue, efforts to establish the use of workflow 
tools and container approaches in HPC environments with concrete applications in simulation-based 
climate science are being considered [e.g., 34, 35, 15, and references therein]. Storing the workflows 
themselves instead of data would also be a very good approach in terms of saving resources (storage space). 


In order to be able to really introduce FDOs in the area of software (model and analysis tools), decisive 
prerequisites must be created, e.g., git commit. Furthermore, there are currently no standards for PID 
profiles. Without standardisation in this area, the introduction and use of FDOs is very difficult, as PID 
profiles form an essential basis for the FAIRness of data objects and contain important information about 
the data or software object. 


6. SUMMARY AND OUTLOOK 


We have used the example of DKRZ user behaviour to devise building blocks of CWFRs. DKRZ provides 
users with computing power and storage capacity but is also a repository for climate data (model and 
observation data), which enables data publication but also direct data reuse. Thus, scientists can carry out 
all the steps of their research work in one overall system. This makes it possible to develop and implement 
the CWER framework in an interlocked manner. 


We identified different users and user groups based on their reasons for using the supplied resources and 
then examined the workflows they apply in their work. Our analysis shows, that the resources are used by 
two primary user types: those who require the HPC-system to perform resource intensive simulations to 
subsequently analyse them and those who reuse, build-on and analyse existing data stored either on local 
file systems or reuse data from external resources. We found that there are superordinate CWFR modules 
that several user groups use in the same way or that are repeated again and again. These offer the potential 
for automation or the use of FDOs. However, we have also taken a critical look at the use of FDOs in this 
subject area as well as the possible future changes to workflows. Because climate science is rapidly 
proceeding towards exa-scale computing [20], current scientific workflows will soon not be fit-for-purpose 
anymore [21]. Furthermore, the reproducibility of simulation results is very limited due to constant changes 
in HPC infrastructures [33, 25]. 


We are currently facing up to the challenges: The Free Evaluation System Framework for Earth System 
Modelling [Freva, 18] is being further developed at DKRZ. Freva can be used to orchestrate all steps 
of scientific workflows necessary to perform research in data intensive climate science. It builds on a 
standardised data base, provides a programming interface, tracks provenance information and can be run 
from the command line or a web interface. Freva is currently undergoing massive updates and improvements, 
like direct data access to repositories or handling numerical model simulations, to support future research 
projects requiring a step-change in the way climate science is performed. 
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Overall, the complexity of the individual CWFR modules we identified here poses a great challenge, 
as it is not sufficient to provide individual solutions, but to always keep an eye on the interaction of all 
components. 


Looking at the defined Canonical Workflow Modules and corresponding sub-modules, one finds that 
they are transferable to all disciplines that follow simulation-based approaches (data-, simulation- and/or 
analysis-intensive disciplines). These are not only fields from physics or chemistry, but also political science, 
economic research, etc. 
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